library(tidyverse)
library(knitr)
library(plotly) ; library(viridis) ; library(gridExtra) ; library(RColorBrewer) ; library(ggpubr)
library(reshape2)
library(mgcv) # GAM
library(caret) ; library(DMwR) ; library(ROCR) ; library(car) ; library(MLmetrics)
library(knitr) ; library(kableExtra)
library(ROCR)
library(expss)
SFARI_colour_hue = function(r) {
pal = c('#FF7631','#FFB100','#E8E328','#8CC83F','#62CCA6','#59B9C9','#b3b3b3','#808080','gray','#d9d9d9')[r]
}
load('./../Data/preprocessed_data.RData')
datExpr = datExpr %>% data.frame
# Ridge Regression output
load('./../Data/Ridge_model.RData')
# SFARI Genes
SFARI_genes = read_csv('./../../../SFARI/Data/SFARI_genes_01-03-2020_w_ensembl_IDs.csv')
SFARI_genes = SFARI_genes[!duplicated(SFARI_genes$ID) & !is.na(SFARI_genes$ID),]
# GO Neuronal annotations: regex 'neuron' in GO functional annotations and label the genes that make a match as neuronal
GO_annotations = read.csv('./../Data/genes_GO_annotations.csv')
GO_neuronal = GO_annotations %>% filter(grepl('neuron', go_term)) %>%
mutate('ID'=as.character(ensembl_gene_id)) %>%
dplyr::select(-ensembl_gene_id) %>% distinct(ID) %>%
mutate('Neuronal'=1)
# Add all this info to predictions
biased_predictions = predictions %>% left_join(SFARI_genes %>% dplyr::select(ID, `gene-score`), by = 'ID') %>%
mutate(gene.score = ifelse(is.na(`gene-score`),
ifelse(ID %in% GO_neuronal$ID, 'Neuronal', 'Others'),
`gene-score`)) %>%
dplyr::select(-`gene-score`)
clustering_selected = 'DynamicHybrid'
clusterings = read_csv('./../Data/clusters.csv')
clusterings$Module = clusterings[,clustering_selected] %>% data.frame %>% unlist %>% unname
assigned_module = clusterings %>% dplyr::select(ID, Module)
rm(rownames_dataset, GO_annotations, datGenes, dds, clustering_selected,
clusterings)
As it can be seen in 20_04_08_Ridge.html, there is a relation between the Probability the model assigns to a gene and the gene’s mean level of expression. This is a problem because we had previously discovered a bias in the SFARI scores related to mean level of expression (Preprocessing/Gandal/AllRegions/RMarkdowns/20_04_03_SFARI_genes.html), which means that this could be a confounding factor in our model and the reason why it seems to perform well, so we need to remove this bias to recover the true biological signal that is mixed with it and improve the quality of our model.
train model with equal weights for all samples
for l in loop:
calculate bias
correct weights to reduce bias
retrain model
Return last model
Parameters:
eta: Learning rate
T: Number of loops
D: Training data
H: Classification model
c: bias constraint
lambda: scaling factor for the weights
\(w_i\) with \(i=1,...,N\): Weights assigned to each sample
Pseudocode:
lambda = 0
w = [1, ..., 1]
c = std(meanExpr(D))
h = train classifier H with lambda and w
for t in 1,,,T do
bias = <h(x), c(x)>
update lambda to lambda - eta*bias
update weights_hat to exp(lambda*mean(c))
update weights to w_hat/(1+w_hat) if y_i=1, 1/(1+w_hat) if y_i=0
update h with new weights
Return h
Using Demographic Parity as a measure of bias: A fair classifier h should make positive predictions each segment \(G\) of the population at the same rate as in all of the population
This definition is for discrete segments of the population. Since our bias is found across all the population but in different measures depending on the mean level of expression of the gene, we have to adapt this definition to a continuous bias scenario
Demographic Parity for our problem: A fair classifier h should make positive predictions on genes with a certail mean level of expression at the same rate as in all of the genes in the dataset
The original formula for the Demographic Parity bias is
$c(x,0) = 0 $ when the prediction is negative
\(c(x,1) = \frac{g(x)}{Z_G}-1\) when the prediction is positive. Where \(g(x)\) is the Kronecker delta to indicate if the sample belongs to the protected group and \(Z_G\) is the proportion of the population that belongs to the group we want to protect against bias
Using this definitions in our problem:
\(g(x):\) Since all our samples belong to the protected group, this would always be 1
\(Z_G:\) Since all of our samples belong to the protected group, this would also always be 1
So our measure of bias \(c(x,1) = \frac{1}{1}-1 = 0\) for all samples. This doesn’t work, so we need to adapt it to our continous case
We can use \(c(x,1) = std(meanExpr(x))\) as the constraint function, this way, when we calculate the bias of the dataset:
\(h(x)\cdot c(x)\) will only be zero if the positive samples are balanced around the mean expression, and the sign of the bias will indicate the direction of the bias
Notes:
This model is only going to be used to obtain the optimal weights for the observations. The weights obtained in this model are going to be the ones used to train the final model afterwards
In the original model we would use oversampling as part of the training function, but here, since we are adding weights to the samples, the oversampling will have to be performed separately (otherwise the weights don’t match the number of observations), so it was moved inside the create_train_test_sets function
### DEFINE FUNCTIONS
create_train_test_sets = function(p, seed){
# Get SFARI Score of all the samples so our train and test sets are balanced for each score
sample_scores = dataset %>% mutate(ID = rownames(.)) %>% dplyr::select(ID) %>%
left_join(biased_predictions %>% dplyr::select(ID, gene.score), by = 'ID') %>%
mutate(gene.score = ifelse(is.na(gene.score), 'None', gene.score))
set.seed(seed)
train_idx = createDataPartition(sample_scores$gene.score, p = p, list = FALSE)
train_set = dataset[train_idx,]
test_set = dataset[-train_idx,]
# Modify SFARI label in train set, save gene IDS (bc we lose them with SMOTE) and perform oversampling using SMOTE
train_set = train_set %>% mutate(SFARI = ifelse(SFARI == TRUE, 'SFARI', 'not_SFARI') %>% as.factor,
ID = rownames(.) %>% as.factor) %>% SMOTE(form = SFARI ~ . - ID)
train_set_IDs = train_set %>% pull(ID)
return(list('train_set' = train_set %>% dplyr::select(-ID), 'test_set' = test_set,
'train_set_IDs' = train_set_IDs))
}
run_weights_model = function(p, seed, Loops){
# CREATE TRAIN AND TEST SETS
train_test_sets = create_train_test_sets(p, seed)
train_set = train_test_sets[['train_set']]
test_set = train_test_sets[['test_set']]
train_set_IDs = train_test_sets[['train_set_IDs']]
# SET INITIAL PARAMETERS
# General parameters
set.seed(seed)
lambda_seq = 10^seq(1, -4, by = -.1)
k_fold = 10
cv_repeats = 5
trControl = trainControl(method = 'repeatedcv', number = k_fold, repeats = cv_repeats, verboseIter = FALSE,
classProbs = TRUE, savePredictions = 'final', summaryFunction = twoClassSummary)
# Bias correction parameters
eta = 0.5
lambda = 0
w = rep(1, nrow(train_set))
# TRAIN MODEL
h = train(SFARI ~., data = train_set, method = 'glmnet', trControl = trControl, metric = 'ROC',
tuneGrid = expand.grid(alpha = 0, lambda = lambda_seq))
# CORRECT BIAS
# Mean Expression info
mean_expr = data.frame('ID' = train_set_IDs) %>%
left_join(data.frame('ID' = rownames(datExpr), 'meanExpr' = rowMeans(datExpr)), by = 'ID') %>%
mutate('meanExpr_std' = (meanExpr-mean(meanExpr))/sd(meanExpr))
# Track behaviour of plot
bias_vec = c()
acc_vec = c()
for(l in 1:Loops){
# Calculate bias for positive predicted samples
bias = mean(mean_expr$meanExpr_std[predict(h, train_set)=='SFARI'])
# Update weights
lambda = lambda - eta*bias
w_hat = exp(lambda*mean_expr$meanExpr_std)
w = 1/(1+w_hat)
w[train_set$SFARI=='SFARI'] = w[train_set$SFARI=='SFARI']*w_hat[train_set$SFARI=='SFARI']
# Update tracking vars
bias_vec = c(bias_vec, bias)
acc_vec = c(acc_vec, mean(predict(h, train_set) == train_set$SFARI))
# Update h
h = train(SFARI ~., data = train_set, method = 'glmnet', weights = w, trControl = trControl,
metric = 'ROC', tuneGrid = expand.grid(alpha = 0, lambda = lambda_seq))
}
return(list('lambda' = lambda, 'bias_vec' = bias_vec, 'acc_vec' = acc_vec))
}
### RUN MODEL TO FIND OPTIMAL WEIGHTS
# Parameters
p = 0.75
seed = 123
Loops = 50
# Run model
model_output = run_weights_model(p, seed, Loops)
# Extract metrics
lambda = model_output[['lambda']]
bias_vec = model_output[['bias_vec']]
acc_vec = model_output[['acc_vec']]
rm(p, seeds, Loops, run_weights_model)
The optimal value of \(\lambda\) is -1.7419319
The bias decreases until it oscilates around zero and the accuracy is not affected much
plot_info = data.frame('iter' = 1:length(bias_vec), 'bias' = bias_vec, 'accuracy' = acc_vec) %>%
melt(id.vars = 'iter')
plot_info %>% ggplot(aes(x=iter, y=value, color = variable)) + geom_line() + xlab('Iteration') + theme_minimal()
Since the bias increases the probability of being classified as 1 for genes with higher levels of expression, as the level of expression of a gene increases, the algorithm:
Increases the weight of genes with a negative label
Decreases the weight of genes with a positive label
mean_expr = data.frame('ID' = rownames(datExpr), 'meanExpr' = rowMeans(datExpr)) %>%
left_join(predictions, by = 'ID') %>% filter(n>0) %>%
mutate('meanExpr_std' = (meanExpr-mean(meanExpr))/sd(meanExpr))
w_hat = exp(lambda*mean_expr$meanExpr_std) # inverso a mean expr
w0 = 1/(1+w_hat) # prop a mean expr
w = 1/(1+w_hat)
w[mean_expr$SFARI %>% as.logical] = w[mean_expr$SFARI %>% as.logical]*w_hat[mean_expr$SFARI %>% as.logical] # inv mean expr Positives, prop Negatives
plot_data = data.frame('meanExpr' = mean_expr$meanExpr, 'w_hat' = w_hat, 'w0' = w0, 'w' = w,
'SFARI' = mean_expr$SFARI, 'pred' = mean_expr$pred)
plot_data %>% ggplot(aes(meanExpr, w, color = SFARI)) + geom_point(alpha = 0.3) +
xlab('Mean Expression') + ylab('Weight') +
ggtitle('Weights of the final model') + ylim(c(0,1)) + theme_minimal()
rm(mean_expr, w_hat, w0, w)
run_final_model = function(p, seed, lambda){
# CREATE TRAIN AND TEST SETS
train_test_sets = create_train_test_sets(p, seed)
train_set = train_test_sets[['train_set']]
test_set = train_test_sets[['test_set']]
train_set_IDs = train_test_sets[['train_set_IDs']]
# SET INITIAL PARAMETERS
# General parameters
set.seed(seed)
lambda_seq = 10^seq(1, -4, by = -.1)
k_fold = 10
cv_repeats = 5
trControl = trainControl(method = 'repeatedcv', number = k_fold, repeats = cv_repeats, verboseIter = FALSE,
classProbs = TRUE, savePredictions = 'final', summaryFunction = twoClassSummary)
# Bias correcting parameters
mean_expr = data.frame('ID' = train_set_IDs) %>%
left_join(data.frame('ID' = rownames(datExpr), 'meanExpr' = rowMeans(datExpr)), by = 'ID') %>%
mutate('meanExpr_std' = (meanExpr-mean(meanExpr))/sd(meanExpr))
w_hat = exp(lambda*mean_expr$meanExpr_std)
w = 1/(1+w_hat)
w[train_set$SFARI=='SFARI'] = w[train_set$SFARI=='SFARI']*w_hat[train_set$SFARI=='SFARI']
# TRAIN MODEL
fit = train(SFARI ~., data = train_set, method = 'glmnet', weights = w, trControl = trControl,
metric = 'ROC', tuneGrid = expand.grid(alpha = 0, lambda = lambda_seq))
# PREDICT TEST SET LABELS AND CREATE PERFORMANCE METRICS
# Predict labels in test set
predictions = fit %>% predict(test_set, type = 'prob')
preds = data.frame('ID' = rownames(test_set), 'corrected_prob' = predictions$SFARI) %>%
mutate(corrected_pred = corrected_prob > 0.5)
# Measure performance of the model
acc = mean(test_set$SFARI==preds$corrected_pred)
prec = Precision(test_set$SFARI %>% as.numeric, preds$corrected_pred %>% as.numeric, positive = '1')
rec = Recall(test_set$SFARI %>% as.numeric, preds$corrected_pred %>% as.numeric, positive = '1')
F1 = F1_Score(test_set$SFARI %>% as.numeric, preds$corrected_pred %>% as.numeric, positive = '1')
pred_ROCR = prediction(preds$corrected_prob, test_set$SFARI)
AUC = performance(pred_ROCR, measure='auc')@y.values[[1]]
# Extract coefficients from features
coefs = coef(fit$finalModel, fit$bestTune$lambda) %>% as.vector
return(list('acc' = acc, 'prec' = prec, 'rec' = rec, 'F1' = F1, 'AUC' = AUC, 'preds' = preds, 'coefs'= coefs))
}
### RUN MODEL
# Parameters
p = 0.75
n_iter = 25
seeds = 123:(123+n_iter-1)
# So the input is the same as in 10_classification_model.html
original_dataset = dataset %>% mutate(ID = rownames(.)) %>%
left_join(biased_predictions %>% dplyr::select(ID, gene.score))
# Store outputs
acc = c()
prec = c()
rec = c()
F1 = c()
AUC = c()
predictions = data.frame('ID' = rownames(dataset), 'SFARI' = dataset$SFARI, 'corrected_prob' = 0,
'corrected_pred' = 0, 'n' = 0)
coefs = data.frame('var' = c('Intercept', colnames(dataset[,-ncol(dataset)])), 'coef' = 0)
for(seed in seeds){
# Run model
model_output = run_final_model(p, seed, lambda)
# Update outputs
acc = c(acc, model_output[['acc']])
prec = c(prec, model_output[['prec']])
rec = c(rec, model_output[['rec']])
F1 = c(F1, model_output[['F1']])
AUC = c(AUC, model_output[['AUC']])
preds = model_output[['preds']]
coefs$coef = coefs$coef + model_output[['coefs']]
update_preds = preds %>% dplyr::select(-ID) %>% mutate(n=1)
predictions[predictions$ID %in% preds$ID, c('corrected_prob','corrected_pred','n')] =
predictions[predictions$ID %in% preds$ID, c('corrected_prob','corrected_pred','n')] +
update_preds
}
coefs = coefs %>% mutate(coef = coef/n_iter)
predictions = predictions %>% mutate(corrected_prob = corrected_prob/n, corrected_pred_count = corrected_pred,
corrected_pred = corrected_prob>0.5) %>%
left_join(biased_predictions %>% dplyr::select(ID, prob, pred), by = 'ID')
rm(p, seeds, update_preds, create_train_test_sets, run_final_model)
test_set = predictions %>% filter(n>0) %>%
left_join(dataset %>% mutate(ID = rownames(.)) %>% dplyr::select(ID, GS, MTcor), by = 'ID')
rownames(test_set) = predictions$ID[predictions$n>0]
The relation between the model probability and the mean level of expression of the genes is not completely gone, there seems to be a negative relation for the genes with the lowest levels of expression
Even though the trend line is not as flat as with the first method, we are not fixing this directly as we were doing before, this is now just a consequence of the corrections we did inside of the model, so it makes sense for it to be less exact than before
# Plot results
plot_data = data.frame('ID'=rownames(datExpr), 'meanExpr'=rowMeans(datExpr)) %>%
right_join(test_set, by='ID')
plot_data %>% ggplot(aes(meanExpr, corrected_prob)) + geom_point(alpha=0.2, color='#0099cc') +
geom_smooth(method='gam', color='gray', alpha=0.2) +
xlab('Mean Expression') + ylab('Corrected Probability') +
theme_minimal() + ggtitle('Mean expression vs Model Probability corrected using adjusted Weights')
conf_mat = test_set %>% apply_labels(SFARI = 'Actual Labels',
corrected_prob = 'Assigned Probability',
corrected_pred = 'Label Prediction')
cro(conf_mat$SFARI, list(conf_mat$corrected_pred, total()))
| Label Prediction | #Total | |||
|---|---|---|---|---|
| FALSE | TRUE | |||
| Actual Labels | ||||
| FALSE | 12851 | 2351 | 15202 | |
| TRUE | 559 | 221 | 780 | |
| #Total cases | 13410 | 2572 | 15982 | |
rm(conf_mat)
pred_ROCR = prediction(test_set$corrected_prob, test_set$SFARI)
roc_ROCR = performance(pred_ROCR, measure='tpr', x.measure='fpr')
auc = performance(pred_ROCR, measure='auc')@y.values[[1]]
plot(roc_ROCR, main=paste0('ROC curve (AUC=',round(auc,2),')'), col='#009999')
abline(a=0, b=1, col='#666666')
lift_ROCR = performance(pred_ROCR, measure='lift', x.measure='rpp')
plot(lift_ROCR, main='Lift curve', col='#86b300')
rm(pred_ROCR, roc_ROCR, AUC, lift_ROCR, acc, acc_vec, auc, bias_vec, F1, prec, rec)
MTcor and aGS have a very small coefficient and absGGS has the largest negative coefficient of all
gene_corr_info = dataset %>% mutate('ID' = rownames(dataset)) %>% dplyr::select(ID, MTcor, SFARI) %>%
left_join(assigned_module, by ='ID') %>% mutate(Module = gsub('#','',Module))
coef_info = coefs %>% mutate('feature' = gsub('MM.','',var)) %>%
left_join(gene_corr_info, by = c('feature' = 'Module')) %>%
dplyr::select(feature, coef, MTcor, SFARI) %>% group_by(feature, coef, MTcor) %>%
summarise('SFARI_perc' = mean(SFARI)) %>% arrange(desc(coef))
coef_info %>% dplyr::select(feature, coef) %>% filter(feature %in% c('Intercept','GS','absGS','MTcor')) %>%
dplyr::rename('Feature' = feature, 'Coefficient' = coef) %>%
kable(align = 'cc', caption = 'Regression Coefficients') %>% kable_styling(full_width = F)
| Feature | Coefficient |
|---|---|
| GS | 0.1132241 |
| MTcor | 0.0026661 |
| Intercept | -0.2344033 |
| absGS | -0.6040255 |
load('./../Data/ORA.RData')
enrichment_SFARI_info = data.frame('Module'=as.character(), 'SFARI_enrichment'=as.numeric())
for(m in names(enrichment_SFARI)){
m_info = enrichment_SFARI[[m]]
enrichment = 1-ifelse('SFARI' %in% m_info$ID, m_info$pvalue[m_info$ID=='SFARI'],1)
enrichment_SFARI_info = enrichment_SFARI_info %>%
add_row(Module = gsub('#','',m), SFARI_enrichment = enrichment)
}
plot_data = coef_info %>% dplyr::rename('Module' = feature) %>%
left_join(enrichment_SFARI_info, by = 'Module') %>% filter(!is.na(MTcor))
ggplotly(plot_data %>% ggplot(aes(coef, SFARI_enrichment)) +
geom_smooth(method = 'lm', color = 'gray', alpha = 0.1) +
geom_point(aes(id = Module), color = paste0('#',plot_data$Module), alpha=0.7) +
theme_minimal() + xlab('Coefficient') +
ylab('SFARI Genes Enrichment'))
rm(enrichment_old_SFARI, enrichment_DGN, enrichment_DO, enrichment_GO, enrichment_KEGG, enrichment_Reactome, m,
m_info, enrichment)
For the first time, there seems to be a possitive relation between the coefficients of the model and the correlation of the module and the diagnosis
ggplotly(coef_info %>% dplyr::rename('Module' = feature) %>% filter(!is.na(MTcor)) %>%
ggplot(aes(coef, MTcor)) + geom_smooth(method = 'lm', color = 'gray', alpha = 0.1) +
geom_point(aes(id = Module), color=paste0('#',coef_info$feature[!is.na(coef_info$MTcor)]), alpha=.7) +
theme_minimal() + xlab('Coefficient') + ylab('Module-Diagnosis correlation'))
SFARI genes have a higher Probability distribution than the rest, but the overlap is larger than before
plot_data = test_set %>% dplyr::select(corrected_prob, SFARI)
ggplotly(plot_data %>% ggplot(aes(corrected_prob, fill=SFARI, color=SFARI)) + geom_density(alpha=0.3) +
geom_vline(xintercept = mean(plot_data$corrected_prob[plot_data$SFARI]), color = '#00C0C2',
linetype='dashed') +
geom_vline(xintercept = mean(plot_data$corrected_prob[!plot_data$SFARI]), color = '#FF7371',
linetype='dashed') +
xlab('Score') + ggtitle('Model Probability distribution by SFARI Label') + theme_minimal())
The relation between probability and SFARI Gene Scores weakened but it’s still there
plot_data = test_set %>% mutate(ID=rownames(test_set)) %>% dplyr::select(ID, corrected_prob) %>%
left_join(original_dataset, by='ID') %>% dplyr::select(ID, corrected_prob, gene.score) %>%
apply_labels(gene.score='SFARI Gene score')
cro(plot_data$gene.score)
| #Total | |
|---|---|
| SFARI Gene score | |
| 1 | 144 |
| 2 | 205 |
| 3 | 430 |
| Neuronal | 927 |
| Others | 14261 |
| #Total cases | 15967 |
mean_vals = plot_data %>% group_by(gene.score) %>% summarise(mean_prob = mean(corrected_prob))
comparisons = list(c('1','2'), c('2','3'), c('3','Neuronal'), c('Neuronal','Others'),
c('1','3'), c('3','Others'), c('2','Neuronal'),
c('1','Neuronal'), c('2','Others'), c('1','Others'))
increase = 0.07
base = 0.75
pos_y_comparisons = c(rep(base, 4), rep(base + increase, 2), base + 2:5*increase)
plot_data %>% filter(!is.na(gene.score)) %>% ggplot(aes(gene.score, corrected_prob, fill=gene.score)) +
geom_boxplot(outlier.colour='#cccccc', outlier.shape='o', outlier.size=3) +
stat_compare_means(comparisons = comparisons, label = 'p.signif', method = 't.test',
method.args = list(var.equal = FALSE), label.y = pos_y_comparisons,
tip.length = .02) +
scale_fill_manual(values=SFARI_colour_hue(r=c(1:3,8,7))) +
ggtitle('Distribution of probabilities by SFARI score') +
xlab('SFARI score') + ylab('Probability') + theme_minimal() + theme(legend.position = 'none')
rm(mean_vals, increase, base, pos_y_comparisons)
The concentration of SFARI genes decrease from 1:3 to 1:6
The genes with the highest probabilities are no longer SFARI Genes
test_set %>% dplyr::select(corrected_prob, SFARI) %>% mutate(ID = rownames(test_set)) %>%
arrange(desc(corrected_prob)) %>% top_n(50, wt=corrected_prob) %>%
left_join(biased_predictions %>% dplyr::select(ID, gene.score, external_gene_id, MTcor, GS),
by = 'ID') %>%
dplyr::rename('GeneSymbol' = external_gene_id, 'Probability' = corrected_prob,
'ModuleDiagnosis_corr' = MTcor, 'GeneSignificance' = GS) %>%
mutate(ModuleDiagnosis_corr = round(ModuleDiagnosis_corr,4), Probability = round(Probability,4),
GeneSignificance = round(GeneSignificance,4)) %>%
left_join(assigned_module, by = 'ID') %>%
dplyr::select(GeneSymbol, GeneSignificance, ModuleDiagnosis_corr, Module, Probability,
gene.score) %>%
kable(caption = 'Genes with highest model probabilities from the test set') %>%
kable_styling(full_width = F)
| GeneSymbol | GeneSignificance | ModuleDiagnosis_corr | Module | Probability | gene.score |
|---|---|---|---|---|---|
| CLMP | 0.0360 | -0.0127 | #00BF7D | 0.7478 | Others |
| SNX25 | 0.0969 | -0.0127 | #00BF7D | 0.7186 | Others |
| HIVEP2 | 0.0188 | -0.9003 | #00BADE | 0.7103 | 1 |
| EGR1 | 0.2173 | 0.5341 | #EF67EB | 0.6962 | Neuronal |
| CORO1A | 0.2003 | 0.5341 | #EF67EB | 0.6836 | Others |
| ARC | 0.2509 | 0.5341 | #EF67EB | 0.6828 | Neuronal |
| VPS13B | 0.3442 | 0.6175 | #44A0FF | 0.6808 | 1 |
| ITGAX | -0.0070 | -0.0127 | #00BF7D | 0.6790 | Others |
| HECTD2 | 0.1857 | 0.5341 | #EF67EB | 0.6776 | Others |
| PLXNC1 | -0.0093 | 0.5341 | #EF67EB | 0.6757 | Others |
| PER2 | 0.1483 | 0.5341 | #EF67EB | 0.6756 | 2 |
| TBC1D8 | 0.0962 | 0.7206 | #FF66A9 | 0.6740 | Others |
| SERPING1 | 0.2974 | 0.8479 | #00C19A | 0.6734 | Others |
| PARVG | 0.5663 | 0.7206 | #FF66A9 | 0.6720 | Others |
| PTPRO | 0.1180 | 0.5341 | #EF67EB | 0.6713 | Neuronal |
| IKZF1 | 0.5540 | 0.7206 | #FF66A9 | 0.6711 | Others |
| ITGAM | 0.2995 | 0.7206 | #FF66A9 | 0.6698 | Others |
| EGR2 | 0.4858 | 0.5341 | #EF67EB | 0.6696 | Neuronal |
| ARHGAP20 | -0.0198 | -0.9003 | #00BADE | 0.6687 | Others |
| BDNF | 0.3529 | 0.5341 | #EF67EB | 0.6686 | Neuronal |
| NR4A3 | 0.4253 | 0.5341 | #EF67EB | 0.6671 | Neuronal |
| SORCS3 | 0.0768 | 0.5341 | #EF67EB | 0.6656 | Others |
| CSF1 | 0.6118 | 0.8479 | #00C19A | 0.6634 | Others |
| SETD5 | 0.2826 | 0.6175 | #44A0FF | 0.6632 | 1 |
| TRIM9 | 0.4827 | 0.5341 | #EF67EB | 0.6620 | Others |
| AHI1 | 0.0158 | -0.0127 | #00BF7D | 0.6614 | Others |
| UNC79 | -0.0418 | -0.0127 | #00BF7D | 0.6608 | 2 |
| ZBTB21 | 0.3410 | 0.5341 | #EF67EB | 0.6602 | Others |
| ARPP21 | 0.4189 | 0.7651 | #77AF00 | 0.6598 | Others |
| SCAF4 | 0.0161 | -0.5573 | #00C0BF | 0.6595 | Others |
| NCOR1 | 0.4663 | 0.5341 | #EF67EB | 0.6590 | 2 |
| EDC4 | 0.0701 | -0.7233 | #FF63B6 | 0.6578 | Others |
| LRRC7 | 0.1828 | -0.9003 | #00BADE | 0.6565 | Others |
| BACH2 | 0.2821 | 0.5341 | #EF67EB | 0.6561 | Others |
| AOAH | 0.0888 | 0.7206 | #FF66A9 | 0.6552 | Others |
| SIK2 | 0.1608 | -0.7233 | #FF63B6 | 0.6549 | Others |
| SAMSN1 | 0.4237 | 0.7206 | #FF66A9 | 0.6546 | Others |
| TP53BP1 | 0.4200 | 0.5341 | #EF67EB | 0.6543 | Others |
| CSMD3 | 0.0773 | 0.3302 | #F47B5B | 0.6543 | Others |
| CELF2 | -0.0655 | -0.9003 | #00BADE | 0.6536 | Others |
| RAPH1 | -0.0554 | -0.3847 | #00B7E7 | 0.6525 | Others |
| PACSIN3 | -0.0813 | -0.9003 | #00BADE | 0.6524 | Others |
| GADL1 | 0.6016 | 0.8479 | #00C19A | 0.6517 | Others |
| PSME4 | 0.2117 | 0.5341 | #EF67EB | 0.6515 | Others |
| RIMBP2 | 0.2592 | 0.5341 | #EF67EB | 0.6511 | Others |
| TET3 | 0.4848 | 0.5341 | #EF67EB | 0.6506 | Others |
| DOCK2 | 0.4748 | 0.7206 | #FF66A9 | 0.6497 | Others |
| MIDN | -0.0560 | -0.5573 | #00C0BF | 0.6494 | Others |
| CTTNBP2 | 0.1711 | 0.6175 | #44A0FF | 0.6492 | 2 |
| CD72 | 0.2570 | 0.5341 | #EF67EB | 0.6490 | Others |
The objective of this model is to identify candidate SFARI genes. For this, we are going to focus on the negative samples (the non-SFARI genes)
negative_set = test_set %>% filter(!SFARI)
negative_set_table = negative_set %>% apply_labels(corrected_prob = 'Assigned Probability',
corrected_pred = 'Label Prediction')
cro(negative_set_table$corrected_pred)
| #Total | |
|---|---|
| Label Prediction | |
| FALSE | 12851 |
| TRUE | 2351 |
| #Total cases | 15202 |
NA genes are predicted as ASD-related
negative_set %>% ggplot(aes(corrected_prob)) + geom_density(color='#F8766D', fill='#F8766D', alpha=0.5) +
geom_vline(xintercept=0.5, color='#333333', linetype='dotted') + xlab('Probability') +
ggtitle('Probability distribution of the Negative samples in the Test Set') +
theme_minimal()
negative_set %>% dplyr::select(corrected_prob, SFARI) %>% mutate(ID = rownames(negative_set)) %>%
arrange(desc(corrected_prob)) %>% top_n(50, wt=corrected_prob) %>%
left_join(biased_predictions %>% dplyr::select(ID, gene.score, external_gene_id, MTcor, GS),
by = 'ID') %>%
dplyr::rename('GeneSymbol' = external_gene_id, 'Probability' = corrected_prob,
'ModuleDiagnosis_corr' = MTcor, 'GeneSignificance' = GS) %>%
mutate(ModuleDiagnosis_corr = round(ModuleDiagnosis_corr,4),
Probability = round(Probability,4),
GeneSignificance = round(GeneSignificance,4)) %>%
left_join(assigned_module, by = 'ID') %>%
dplyr::select(GeneSymbol, GeneSignificance, ModuleDiagnosis_corr, Module, Probability,
gene.score) %>%
kable(caption = 'Genes with highest model probabilities from the Negative set') %>%
kable_styling(full_width = F)
| GeneSymbol | GeneSignificance | ModuleDiagnosis_corr | Module | Probability | gene.score |
|---|---|---|---|---|---|
| CLMP | 0.0360 | -0.0127 | #00BF7D | 0.7478 | Others |
| SNX25 | 0.0969 | -0.0127 | #00BF7D | 0.7186 | Others |
| EGR1 | 0.2173 | 0.5341 | #EF67EB | 0.6962 | Neuronal |
| CORO1A | 0.2003 | 0.5341 | #EF67EB | 0.6836 | Others |
| ARC | 0.2509 | 0.5341 | #EF67EB | 0.6828 | Neuronal |
| ITGAX | -0.0070 | -0.0127 | #00BF7D | 0.6790 | Others |
| HECTD2 | 0.1857 | 0.5341 | #EF67EB | 0.6776 | Others |
| PLXNC1 | -0.0093 | 0.5341 | #EF67EB | 0.6757 | Others |
| TBC1D8 | 0.0962 | 0.7206 | #FF66A9 | 0.6740 | Others |
| SERPING1 | 0.2974 | 0.8479 | #00C19A | 0.6734 | Others |
| PARVG | 0.5663 | 0.7206 | #FF66A9 | 0.6720 | Others |
| PTPRO | 0.1180 | 0.5341 | #EF67EB | 0.6713 | Neuronal |
| IKZF1 | 0.5540 | 0.7206 | #FF66A9 | 0.6711 | Others |
| ITGAM | 0.2995 | 0.7206 | #FF66A9 | 0.6698 | Others |
| EGR2 | 0.4858 | 0.5341 | #EF67EB | 0.6696 | Neuronal |
| ARHGAP20 | -0.0198 | -0.9003 | #00BADE | 0.6687 | Others |
| BDNF | 0.3529 | 0.5341 | #EF67EB | 0.6686 | Neuronal |
| NR4A3 | 0.4253 | 0.5341 | #EF67EB | 0.6671 | Neuronal |
| SORCS3 | 0.0768 | 0.5341 | #EF67EB | 0.6656 | Others |
| CSF1 | 0.6118 | 0.8479 | #00C19A | 0.6634 | Others |
| TRIM9 | 0.4827 | 0.5341 | #EF67EB | 0.6620 | Others |
| AHI1 | 0.0158 | -0.0127 | #00BF7D | 0.6614 | Others |
| ZBTB21 | 0.3410 | 0.5341 | #EF67EB | 0.6602 | Others |
| ARPP21 | 0.4189 | 0.7651 | #77AF00 | 0.6598 | Others |
| SCAF4 | 0.0161 | -0.5573 | #00C0BF | 0.6595 | Others |
| EDC4 | 0.0701 | -0.7233 | #FF63B6 | 0.6578 | Others |
| LRRC7 | 0.1828 | -0.9003 | #00BADE | 0.6565 | Others |
| BACH2 | 0.2821 | 0.5341 | #EF67EB | 0.6561 | Others |
| AOAH | 0.0888 | 0.7206 | #FF66A9 | 0.6552 | Others |
| SIK2 | 0.1608 | -0.7233 | #FF63B6 | 0.6549 | Others |
| SAMSN1 | 0.4237 | 0.7206 | #FF66A9 | 0.6546 | Others |
| TP53BP1 | 0.4200 | 0.5341 | #EF67EB | 0.6543 | Others |
| CSMD3 | 0.0773 | 0.3302 | #F47B5B | 0.6543 | Others |
| CELF2 | -0.0655 | -0.9003 | #00BADE | 0.6536 | Others |
| RAPH1 | -0.0554 | -0.3847 | #00B7E7 | 0.6525 | Others |
| PACSIN3 | -0.0813 | -0.9003 | #00BADE | 0.6524 | Others |
| GADL1 | 0.6016 | 0.8479 | #00C19A | 0.6517 | Others |
| PSME4 | 0.2117 | 0.5341 | #EF67EB | 0.6515 | Others |
| RIMBP2 | 0.2592 | 0.5341 | #EF67EB | 0.6511 | Others |
| TET3 | 0.4848 | 0.5341 | #EF67EB | 0.6506 | Others |
| DOCK2 | 0.4748 | 0.7206 | #FF66A9 | 0.6497 | Others |
| MIDN | -0.0560 | -0.5573 | #00C0BF | 0.6494 | Others |
| CD72 | 0.2570 | 0.5341 | #EF67EB | 0.6490 | Others |
| GIT2 | -0.1432 | -0.0127 | #00BF7D | 0.6480 | Others |
| DNAJB5 | 0.5220 | 0.5341 | #EF67EB | 0.6469 | Others |
| DUSP6 | 0.4091 | 0.5341 | #EF67EB | 0.6463 | Others |
| ANKRD36 | -0.0351 | -0.0127 | #00BF7D | 0.6448 | Others |
| NR4A1 | 0.4369 | 0.9612 | #88AC00 | 0.6447 | Others |
| CD37 | 0.4633 | 0.7206 | #FF66A9 | 0.6442 | Others |
| DDX27 | 0.2347 | -0.0127 | #00BF7D | 0.6441 | Others |
Comparison with the original model’s probabilities:
The genes with the highest Probabilitys were affected the most as a group
Iin general genes with the lowest Probabilities got their score increased and the genes with the highest scores, decreased
The change in Probability by gene is much larger than with the Post Processing approach
negative_set %>% mutate(diff = abs(prob-corrected_prob)) %>%
ggplot(aes(prob, corrected_prob, color = diff)) + geom_point(alpha=0.2) + scale_color_viridis() +
geom_abline(slope=1, intercept=0, color='gray', linetype='dashed') +
geom_smooth(color='#666666', alpha=0.5, se=TRUE, size=0.5) + coord_fixed() +
xlab('Original probability') + ylab('Corrected probability') + theme_minimal() + theme(legend.position = 'none')
negative_set_table = negative_set %>% apply_labels(corrected_prob = 'Corrected Probability',
corrected_pred = 'Corrected Class Prediction',
pred = 'Original Class Prediction') %>%
filter(!is.na(pred))
cro(negative_set_table$pred, list(negative_set_table$corrected_pred, total()))
| Corrected Class Prediction | #Total | |||
|---|---|---|---|---|
| FALSE | TRUE | |||
| Original Class Prediction | ||||
| FALSE | 11168 | 949 | 12117 | |
| TRUE | 1671 | 1400 | 3071 | |
| #Total cases | 12839 | 2349 | 15188 | |
83% of the genes maintained their original predicted class
rm(negative_set_table)
The relation is the opposite as before, the higher the Gene Significance, the higher the probability, with the lowest probabilities corresponding to under-expressed genes
*The transparent verison of the trend line is the original trend line
negative_set %>% ggplot(aes(corrected_prob, GS, color=MTcor)) + geom_point() +
geom_smooth(method='gam', color='#666666') + ylab('Gene Significance') +
geom_line(stat='smooth', method='gam', color='#666666', alpha=0.5, size=1.2, aes(x=prob)) +
geom_hline(yintercept=mean(negative_set$GS), color='gray', linetype='dashed') +
scale_color_gradientn(colours=c('#F8766D','white','#00BFC4')) + xlab('Corrected Probability') +
ggtitle('Relation between the Model\'s Corrected Probability and Gene Significance') +
theme_minimal()
Summarised version of Probability vs mean expression, plotting by module instead of by gene
The difference in the trend lines between this plot and the one above is that the one above takes all the points into consideration while this considers each module as an observation by itself, so the top one is strongly affected by big modules and the bottom one treats all modules the same
The transparent version of each point and trend lines are the original values and trends before the bias correction
plot_data = negative_set %>% mutate(ID = rownames(.)) %>% left_join(assigned_module, by = 'ID') %>%
group_by(MTcor, Module) %>% summarise(mean = mean(prob), sd = sd(prob),
new_mean = mean(corrected_prob),
new_sd = sd(corrected_prob), n = n()) %>%
mutate(MTcor_sign = ifelse(MTcor>0, 'Positive', 'Negative')) %>%
dplyr::select(Module, MTcor, MTcor_sign, mean, new_mean, sd, new_sd, n) %>% distinct()
colnames(plot_data)[1] = 'ID'
ggplotly(plot_data %>% ggplot(aes(MTcor, new_mean, size=n, color=MTcor_sign)) + geom_point(aes(id = ID)) +
geom_smooth(method='loess', color='gray', se=FALSE) + geom_smooth(method='lm', se=FALSE) +
geom_point(aes(y=mean), alpha=0.3) +
geom_line(stat='smooth', method='loess', color='gray', se=FALSE, alpha=0.3, size=1.2, aes(y=mean)) +
geom_line(stat='smooth', method='lm', se=FALSE, alpha=0.3, size=1.2, aes(y=mean)) +
xlab('Module-Diagnosis correlation') + ylab('Mean Corrected Probability by Module') +
theme_minimal() + theme(legend.position='none'))
To check if correcting by gene also corrected by module: Yes, the bias seems to be removed completely, it may even be a bit negative now
mean_and_sd = data.frame(ID=rownames(datExpr), meanExpr=rowMeans(datExpr), sdExpr=apply(datExpr,1,sd))
plot_data = negative_set %>% mutate(ID=rownames(test_set)[!test_set$SFARI]) %>%
left_join(mean_and_sd, by='ID') %>%
left_join(assigned_module, by='ID')
plot_data2 = plot_data %>% group_by(Module) %>% summarise(meanExpr = mean(meanExpr), meanProb = mean(prob),
new_meanProb = mean(corrected_prob), n=n())
ggplotly(plot_data2 %>% ggplot(aes(meanExpr, new_meanProb, size=n)) +
geom_point(color=plot_data2$Module) + geom_point(color=plot_data2$Module, alpha=0.3, aes(y=meanProb)) +
geom_smooth(method='loess', se=TRUE, color='gray', alpha=0.1, size=0.7) +
geom_line(stat='smooth', method='loess', se=TRUE, color='gray', alpha=0.4, size=1.2, aes(y=meanProb)) +
xlab('Mean Expression') + ylab('Corrected Probability') +
ggtitle('Mean expression vs corrected Model Probability by Module') +
theme_minimal() + theme(legend.position='none'))
rm(plot_data2, mean_and_sd)
For under-expressed genes, the relation between LFC and probability got inverted. The difference is quite big
For over-expressed genes, the trend didn’t change, it just got translated higher. Now in general, over-expressed genes have higher probabilities than over-expressed genes
plot_data = negative_set %>% mutate(ID=rownames(test_set)[!test_set$SFARI]) %>%
left_join(DE_info %>% data.frame %>% mutate(ID=rownames(.)), by='ID')
plot_data %>% ggplot(aes(log2FoldChange, corrected_prob)) + geom_point(alpha=0.1, color='#0099cc') +
geom_smooth(method='loess', color='gray', alpha=0.1) +
geom_line(stat='smooth', method='loess', color='gray', alpha=0.4, size=1.5, aes(y=prob)) +
xlab('LFC') + ylab('Corrected Probability') +
theme_minimal() + ggtitle('LFC vs model probability by gene')
The trends for DE genes don’t seem to have changed much, they were just translated, upwards for over-expressed genes, and downwards for under-expressed genes
The big difference we see in the plot above is just the connection between the patterns for under and over-expressed genes
The group that seem to be affected the least is the non-DE group of under-expressed genes
p1 = plot_data %>% filter(log2FoldChange<0) %>% mutate(DE = padj<0.05) %>%
ggplot(aes(log2FoldChange, corrected_prob, color=DE)) + geom_point(alpha=0.1) +
geom_smooth(method='loess', alpha=0.1) + xlab('') + ylab('Corrected Probability') +
ylim(c(min(plot_data$corrected_prob), max(plot_data$corrected_prob))) +
geom_line(stat='smooth', method='loess', alpha=0.4, size=1.5, aes(y=prob, color = DE)) +
theme_minimal() + theme(legend.position = 'none', plot.margin=unit(c(1,-0.3,1,1), 'cm'))
p2 = plot_data %>% filter(log2FoldChange>=0) %>% mutate(DE = padj<0.05) %>%
ggplot(aes(log2FoldChange, corrected_prob, color=DE)) + geom_point(alpha=0.1) +
geom_smooth(method='loess', alpha=0.1) + xlab('') + ylab('Corrected Probability') + ylab('') +
scale_y_continuous(position = 'right',
limits = c(min(plot_data$corrected_prob), max(plot_data$corrected_prob))) +
geom_line(stat='smooth', method = 'loess', alpha=0.4, size=1.5, aes(y = prob, color = DE)) +
theme_minimal() + theme(plot.margin = unit(c(1,1,1,-0.3), 'cm'), axis.ticks.y = element_blank())
grid.arrange(p1, p2, nrow=1, top = 'LFC vs model probability by gene', bottom = 'LFC')
rm(p1, p2)
The Probabilitys decreased for modules with negative correlation and increased for modules with positive correlation
module_score = negative_set %>% mutate(ID=rownames(test_set)[!test_set$SFARI]) %>%
left_join(biased_predictions %>% dplyr::select(ID, gene.score), by='ID') %>%
left_join(assigned_module, by = 'ID') %>%
dplyr::select(ID, prob, corrected_prob, Module, MTcor) %>%
left_join(data.frame(MTcor=unique(dataset$MTcor)) %>% arrange(by=MTcor) %>%
mutate(order=1:length(unique(dataset$MTcor))), by='MTcor')
ggplotly(module_score %>% ggplot(aes(MTcor, corrected_prob)) +
geom_point(color=module_score$Module, aes(id=ID, alpha=corrected_prob^4)) +
geom_hline(yintercept=mean(module_score$corrected_prob), color='gray', linetype='dotted') +
geom_line(stat='smooth', method = 'loess', color='gray', alpha=0.5, size=1.5, aes(x=MTcor, y=prob)) +
geom_smooth(color='gray', method = 'loess', se = FALSE, alpha=0.3) + theme_minimal() +
xlab('Module-Diagnosis correlation') + ylab('Corrected Probability'))
This bias correction makes bigger changes in the distribution of the probabilities than the post-processing one. Its main effect seems to be to reduce the importance of the under-expressed genes and increase the importance of over-expressed genes
In general the performance metrics decrease, but this isn’t necessarily wrong, since we knew part of the good performance of the model was because of the confounding factor related to mean level of expression, so it was expected for the performance of the model to decrease once we have removed this signal
write.csv(test_set, file='./../Data/RM_weighting_bias_correction.csv', row.names = TRUE)
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] expss_0.10.2 kableExtra_1.1.0 MLmetrics_1.1.1 car_3.0-7
## [5] carData_3.0-3 ROCR_1.0-7 gplots_3.0.3 DMwR_0.4.1
## [9] caret_6.0-86 lattice_0.20-41 mgcv_1.8-31 nlme_3.1-147
## [13] reshape2_1.4.4 ggpubr_0.2.5 magrittr_1.5 RColorBrewer_1.1-2
## [17] gridExtra_2.3 viridis_0.5.1 viridisLite_0.3.0 plotly_4.9.2
## [21] knitr_1.28 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.0
## [25] purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.1
## [29] ggplot2_3.3.2 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.1.8
## [3] Hmisc_4.4-0 plyr_1.8.6
## [5] lazyeval_0.2.2 splines_3.6.3
## [7] crosstalk_1.1.0.1 BiocParallel_1.18.1
## [9] GenomeInfoDb_1.20.0 digest_0.6.25
## [11] foreach_1.5.0 htmltools_0.4.0
## [13] gdata_2.18.0 fansi_0.4.1
## [15] memoise_1.1.0 checkmate_2.0.0
## [17] cluster_2.1.0 openxlsx_4.1.4
## [19] annotate_1.62.0 recipes_0.1.10
## [21] modelr_0.1.6 gower_0.2.1
## [23] matrixStats_0.56.0 xts_0.12-0
## [25] jpeg_0.1-8.1 colorspace_1.4-1
## [27] blob_1.2.1 rvest_0.3.5
## [29] haven_2.2.0 xfun_0.12
## [31] crayon_1.3.4 RCurl_1.98-1.2
## [33] jsonlite_1.7.0 genefilter_1.66.0
## [35] survival_3.1-12 zoo_1.8-8
## [37] iterators_1.0.12 glue_1.4.1
## [39] gtable_0.3.0 ipred_0.9-9
## [41] zlibbioc_1.30.0 XVector_0.24.0
## [43] webshot_0.5.2 DelayedArray_0.10.0
## [45] shape_1.4.4 BiocGenerics_0.30.0
## [47] quantmod_0.4.17 abind_1.4-5
## [49] scales_1.1.1 DBI_1.1.0
## [51] Rcpp_1.0.4.6 xtable_1.8-4
## [53] htmlTable_1.13.3 bit_1.1-15.2
## [55] foreign_0.8-76 Formula_1.2-3
## [57] stats4_3.6.3 lava_1.6.7
## [59] prodlim_2019.11.13 glmnet_3.0-2
## [61] htmlwidgets_1.5.1 httr_1.4.1
## [63] acepack_1.4.1 ellipsis_0.3.1
## [65] farver_2.0.3 XML_3.99-0.3
## [67] pkgconfig_2.0.3 nnet_7.3-14
## [69] dbplyr_1.4.2 locfit_1.5-9.4
## [71] labeling_0.3 AnnotationDbi_1.46.1
## [73] tidyselect_1.1.0 rlang_0.4.6
## [75] munsell_0.5.0 cellranger_1.1.0
## [77] tools_3.6.3 cli_2.0.2
## [79] RSQLite_2.2.0 generics_0.0.2
## [81] broom_0.5.5 evaluate_0.14
## [83] yaml_2.2.1 bit64_0.9-7
## [85] ModelMetrics_1.2.2.2 fs_1.4.0
## [87] zip_2.0.4 caTools_1.18.0
## [89] xml2_1.2.5 compiler_3.6.3
## [91] rstudioapi_0.11 png_0.1-7
## [93] curl_4.3 ggsignif_0.6.0
## [95] reprex_0.3.0 geneplotter_1.62.0
## [97] stringi_1.4.6 highr_0.8
## [99] Matrix_1.2-18 vctrs_0.3.1
## [101] pillar_1.4.4 lifecycle_0.2.0
## [103] data.table_1.12.8 bitops_1.0-6
## [105] GenomicRanges_1.36.1 latticeExtra_0.6-29
## [107] R6_2.4.1 KernSmooth_2.23-17
## [109] rio_0.5.16 IRanges_2.18.3
## [111] codetools_0.2-16 MASS_7.3-51.6
## [113] gtools_3.8.2 assertthat_0.2.1
## [115] SummarizedExperiment_1.14.1 DESeq2_1.24.0
## [117] withr_2.2.0 S4Vectors_0.22.1
## [119] GenomeInfoDbData_1.2.1 parallel_3.6.3
## [121] hms_0.5.3 rpart_4.1-15
## [123] timeDate_3043.102 class_7.3-17
## [125] rmarkdown_2.1 TTR_0.23-6
## [127] pROC_1.16.2 base64enc_0.1-3
## [129] Biobase_2.44.0 lubridate_1.7.4